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In this article it is shown how optimized and dedicated microarray experiments can be used to 
study the thermodynamics of DNA hybridization for a large number of different conformations in 
a highly parallel fashion. In particular, free energy penalties for mismatches are obtained in two 
independent ways and are shown to be correlated with values from melting experiments in solution 
reported in the literature. The additivity principle, which is at the basis of the nearest-neighbor 
model, and according to which the penalty for two isolated mismatches is equal to the sum of the 
independent penalties, is thoroughly tested. Additivity is shown to break down for a mismatch 
distance below 5 nt. The behavior of mismatches in the vicinity of the helix edges, and the behavior 
of tandem mismatches are also investigated. Finally, some thermodynamic outlying sequences are 
observed and highlighted. These sequences contain combinations of GA mismatches. The analysis 
of the microarray data reported in this article provides new insights on the DNA hybridization 
parameters and can help to increase the accuracy of hybridization-based technologies. 



I. INTRODUCTION 

Hybridization of single-stranded nucleic acids to form 
a duplex is a reversible chemical reaction, which is at 
the basis of many processess and techniques currently 
used in biotechnology, as for instance PGR Due to 
its central importance, hybridization has been intensively 
studied in experiments (focusing on the thermodynam- 
ics [2) i3j or kinetics of the process) and also in computer 
simulations [J. 

The thermodynamics of DNA hybridization is usually 
described by the nearest-neighbor (NN) model [5J. This 
model assumes that the free energy of a duplex can be ex- 
pressed as a sum of dinucleotide stability parameters; it is 
therefore based on the principle of additivity. From the 
NN parameters one can, for instance, estimate melting 
temperatures, compute melting curves and predict sec- 
ondary structures in which RNA molecules fold [BJ • In 
the folding problem, many different local conformations 
arise as single nucleotide mismatches, bulges, stem-loop 
structures, etc. Describing these conformations in the 
framework of the NN model is very challenging and re- 
quires a large number of parameters [6J. However, only 
a limited number of them have been measured directly 
in experiments |S]. In addition, one may also wonder 
whether additivity holds in such cases. To investigate a 
large number of different conformations, it would be very 
advantageous to have access to high-throughput mea- 
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surements, provided that they are sufficiently accurate. 

In this article, we quantitatively determine free energy 
penalties for mismatches using microarray data obtained 
from a set of optimized and dedicated experiments. In 
DNA microarrays, several thousand of different sequences 
can be spotted at a surface, hence a large number of hy- 
bridization reactions takes place simultaneously. We use 
two different approaches: the first one is based on a lin- 
ear regression of a large set of experimental data points 
(« 1000) to fit 58 NN dinucleotide parameters. The sec- 
ond method relies on the computation of the logarithm 
of the ratios of fluorescent intensities measured from dif- 
ferent spots of the arrays. We show that both methods 
provide highly correlated set of NN parameters. In addi- 
tion, the second approach allows to probe the limitations 
of the NN model. It is found that when two mismatches 
are closer than 5 nt additivity breaks down and the free 
energy of the duplex is not equal to the sum of the two 
separate contributions of isolated mismatches. We also 
quantify the influence of mismatches close to the edge of 
the double helix and show that the free energy penalty is 
much weaker in those cases. Overall, this work provides 
new insights on DNA hybridization thermodynamics and 
can help to increase the accuracy of hybridization-based 
technologies. 



II. MATERIALS AND METHODS 

The experiments were performed on custom Agilent 
arrays, following a standard protocol, which is discussed 
in P]. In each experiment, a single target sequence in 
solution was hybridized at concentrations ranging typi- 
cally from ~' 10 picoM to '-^ 2 nanoM. In total, three 
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5' 


-CTGGTCTTAGATGCAGCGACTGTTT- 


-poly (A) 


-3' 


-Cy3 




5' 


-CTGCACAATTCCGGAGCTATGAATT- 


-poly (A) 


-3' 


-Cy3 


tz- 


5' 


-AATAATGCTCATTAGGCACCGGGAA- 


-poly (A) 


-3' 


-Cy3 



TABLE I. Target sequences used in the experiments. At the 3' side of each sequence a 20-mer poly(A) is attached, terminating 
with a Cy3 fiuorophore. The targets were selected from Optimal Design criteria [2] (Supplementary Data). Each target is 
hybridized separately on specific microarrays containing mismatched probes with up to two mismatches with respect to the 
target. Note that ti and t2 share a common triplet of nucleotides AGO at the same sequence position (in bold characters). 
The mismatches centered around this triplet will be discussed in some details in the 'Results' section. 



different sets of experiments were performed using the 
target sequences sliown in Table |T] These sequences were 
selected from 25-mers human DNAs using Optimal De- 
sign methods [2] . The theory of Optimal Design provides 
some criteria of selecting an optimal set of measurements, 
which minimize the uncertainties in the parameters of a 
statistical model (see Supplementary Data). 

From the targets of Table [l] three different microarrays 
were designed and used for hybridization to either ti, t2 
or t^. Each microarray contains probes with either zero, 
one or two mismatches with respect to the given target, 
covering all possible mismatch combinations. In a stretch 
of N nucleotides there can be 3A^ single mismatch probes 
and 9iV(7V - l)/2 double mismatch probes. For = 25 
this gives in total 2776 different sequences, which were 
spotted in the microarray. The sequences were replicated 
15 times to fill up completely a 44K custom Agilent array. 
Another design was also used mismatches have a minimal 
distance of 4 nt from the border and a minimal relative 
distance of 5 nt. In this case the total number of se- 
quences is 646. These sequences were replicated 23 times 
to fill a 15K custom arrays. We considered hybridizing 
sequences of 25 nucleotides. This is because in previous 
studies [11] these sequences were found to attain ther- 
modynamic equilibrium after ~ 3 h of hybridization (in 
the experiments the hybridization time is of 17 h, hence 
thermodynamic equilibrium is guaranteed) . A hybridiza- 
tion experiment provides a large number of fluorescence 
intensities: the highest intensity is from spots containing 
perfect match sequence, whereas the intensity decreases 
with the number and type of mismatches. The reduction 
of the intensity provides an estimate of the hybridization 
free energy. We use two different methods to obtain the 
NN parameters, as discussed in the next sections. 



III. RESULTS 

A. Nearest-neighbor parameters from linear 
regression 

Equilibrium thermodynamics predicts that the mea- 
sured fluorescence intensity from a spot i equals to: 

/, = /o + Ace-^'^-/«^ (1) 

where AG^ is the hybridization free energy between the 
target sequence and a probe sequence in i, A is a pa- 



rameter, which sets the intensity scale, c the target con- 
centration, R the gas constant and T the temperature 
(experiments are performed at T = 65°C = 338-?^, which 
is the value of the temperature used in the rest of the 
analysis). Although the data analyzed are background- 
subtracted from the Agilent scanner, there remains al- 
ways some small aspeciflc signals, which we denote by /g 
in Equation ([ij. In the experiments li is obtained from 
the average over typically approximately 15 replicated 
spots. One should note that Equation ([T]) is valid at suf- 
flciently low target concentrations, i.e. when only a lim- 
ited fraction of probes is hybridized in a spot, hence far 
from chemical saturation. On the other hand, at very low 
concentrations, the specific signal, i.e. the second term 
in Equation ([ij, can become comparable to /q- There- 
fore, for the analysis of the data we restricted ourselves to 
intermediate concentrations and intensities for which we 
explicitly verified that the intensities scale linearly with 
concentrations, as predicted by Equation ([ij (more de- 
tails can be found in the Supplementary Data). In the 
intensity scale of the experiments /q ~ 1, whereas the 
values used in the analysis are li > 10. In practice, the 
large majority of the intensities in experiments with tar- 
get concentration c = 100 pM or higher are above this 
threshold value. 

In the following, we will consider the logarithm of the 
intensities measured with respect to the perfect match 
(PM) intensity. Using Equation ([T|), for li ^ Iq we get: 

y, = In - In Ipm = — (2) 

which defines the free energy penalty of probe i with 
respect to the perfectly matching probe. This penalty 
can be expressed as a sum of NN dinucleotide parameters. 
Consider, for instance, the example of a probe i with 
a single mismatch of type A with respect to the target 
nucleotide G and with neighboring nucleotides G and T. 
We have: 
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FIG. 1. Plot of the Intensities for concentrations c = 100 pM 
from the experiments using hybridization of targets t\ , t2 or 
is, as a function of the AAG parameters obtained from least- 
squared minimization. The data agree well with hybridization 
isotherm given in Equation ([2| , shown as a straight line in the 



linear-log scale. 



We use the following notation: the target sequence is the 
bottom strand and the probe sequence, which is oriented 
from 5'— 3', is the top strand. This example corresponds 
to target ti or t2 at position 10, counting from 3' end 
(the triplet of nucleotides are indicated in bold in Ta- 
ble |l| . In Equation ([S]) A AC? is defined as the free en- 
ergy penalty of an isolated mismatch in a DNA duplex. 
This penalty is expected to be a local effect. In the NN 
model this locality is inherent: the dots in Equation ([3| 
indicate identical nucleotides in the two sequences, their 
contribution cancels out and leaves per isolated mismatch 
only four dinucleotide parameters around the mismatch 
position. There are in total only 58 such dinucleotide 
parameters: 10 perfect match parameters and 48 single 
mismatch parameters (taking into account symmetries). 
The dinucleotide parameters are not directly experimen- 
tally accessible and are not unique [T2], e.g. they can be 
shifted by some constant value such that the physically 
accessible AAG remains unchanged (see Supplementary 
Data). 

Equations ^ and ^ define a linear problem: each 
measured yi can be expressed by a linear combination of 
dinucleotide parameters. In order to extract the param- 
eters from the data we combined the results of the three 
experiments and performed a least square minimization 
of Equation Q. Mismatches closer than five sites from 
the helix edges were excluded from the analysis, as well 
as pairs of mismatches with a distance smaller than 5 nt. 

The 58 adjustable parameters were fitted on a set of 
about a thousand of experimental data points above the 
intensity threshold. The fitted parameters then applied 
to produce the plot as shown in Figure [l] for all available 
intensities of the experiments in which either sequence ti , 
t2 or ^3 was hybridized on its corresponding microarray 




AAG^^j (kcal/mol) 



FIG. 2. Plot of free energy penalties AAG for triplets ob- 
tained from the microarray fit versus those from hybridiza- 
tion in solution [§]. The central mismatching nucleotides of 
the triplet (underlined in Equation ([sjl) are indicated in the 
plot. 



at a concentration of c 100 pM. The data are plotted 
as a function of the unique AAG for triplets defined as 
in Equation ([3]). We note that there is very good agree- 
ment between the data and the thermodynamic model 
of Equation ([T]) . The experiments follow the equilibrium 
isotherm (a straight line with a slope equal to 1/RT) for 
a range of intensities of more than four orders of magni- 
tude. A previous study [9 in which hybridizing strands 
were 30-mers did not provide a single straight line in a 
In / versus AAG plot. Deviations due to lack of thermo- 
dynamic equilibrium were observed in the high-intensity 
ranges, as discussed in |lll fT5] . 

Further it is important to note that we do not only find 
internally consistent results, but that our microarray- 
derived free energy parameters also correlate to a fair 
degree with those reported in literature for hybridiza- 
tion in solution [8] . Figure [2] shows a correlation plot 
of the free energy penalties (i.e. the AAG defined as 
in the example of Equation ([3])) obtained from the mi- 
croarray data analysis and those from SantaLucia et al. 
from [8^. The Spearman correlation coefficient is equal 
to 0.855. This clearly shows that free energy parameters 
for DNA features measured by the presented microarray 
approach also apply for thermodynamic properties in so- 
lution. This opens the highly parallelled microarray tool- 
box for the study of thermodynamics of DNA structures. 
An example is discussed in the next section. 
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Tm, n 
X, x+Ax 



X x+Ax 

FIG. 3. Schematic representation of hybridizing strands in the 
microarray experiment. From the appropriate ratios of inten- 
sities measured from these spots, the free energy parameters 
can be determined and the additivity principle can be tested. 
As in the rest of the article the lower strand is the fixed tar- 
get sequence. The upper strand is the probe sequence. The 
filled triangles denote mismatching nucleotides. In the four 
examples from the top we show: (a) hybridization with a 
PM probe, (b,c) hybridization with a single mismatch probe 
where the mismatching nucleotides are m and n at positions 
X and x-\-l\x respectively, (d) hybridization with a probe car- 
rying two mismatches. We use the notations J™, 12+ Ax a-nd 
-^aTi+Ax denote the corresponding intensities measured in 
the experiment. 



S 0.25 - 




FIG. 4. Parameter a, the relative deviation from additivity, 
from the experiment of target ti, averaged over x,m and n 
as a function of the distance |A2;| between two mismatches. 
The inset shows the plot with a in log scale. 



sum of the individual penalties of Equations ^ and (|5| . 
To test this, we introduce 



B. Nearest-neighbor parameters from ratios of 
intensities: probing additivity 

The crucial assumption of the NN model is additivity 
of local free energy contributions. We probe here the lim- 
its of additivity of free energy penalties as a function of 
the distance between two mismatches. We will access the 
free energy parameters by comparing ratios of intensities 
measured from different spots in the microarray. 

Hereto, we combine microarray spots that contain 
probes with zero, one or two mismatches with respect 
to the target and we denote the location of the mismatch 
hy X or X + Aa; as illustrated in Figure |3] The associ- 
ated free energy penalties can then be derived from the 
intensity measurements as follows 



AAG" + AAG" 



A AG™ = -i?T In 
AAGS+A. = -i?Tln 



V IPM ) 



-RTln 



I: 



(4) 
(5) 

(6) 



in which the superscript m and n represent the three 
possible mismatching nucleotides at location x and x + 
Ax respectively. If the additivity of the NN model holds, 
the free energy penalty of Equation ([6| should equal the 



^^^7,x+Ax 



AAG'" + aag: 



(7) 



x+Ax 



which measures the relative deviation from additivity. 
Figure |4] shows the experimental results for a in which 
we averaged over x, m and n, leaving a as a function 
of the distance |Ax| between two mismatches. From this 
data, we notice that a has a value of about zero when the 
mismatches are separated by > 5 nt, but a clear positive 
value for smaller Ax. Apparently the free energy penalty 
of two nearby mismatches is smaller than the sum of the 
two individual contributions, resulting in a positive a. 
Furthermore, the inset from Figure |4] shows that the re- 
lationship is linear in a semi logarithmic plot, hence a 
decays exponentially with |Ax|. Note that at Ax ~ 
only one mismatch is present, hence m = n and a will 
be identical to 1/2 according to Equation All these 
observation result from direct measurement values, con- 
taining no fitting parameters and strongly suggest that 
in double-stranded DNA, mismatches have a physical in- 
teraction with each other which decays exponentially to 
zero over a distance of about five nucleotides. 

These results are setting some limitations on the ad- 
ditivity of the NN model. However, outside this interac- 
tion region of 4 nt we expect the NN model to hold i.e. 
a should be zero and mismatches can be considered as 
isolated. This can be explicitly checked in a very direct 
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way. When a = we get from Equation ([t]) 

AAG™ = AAG::;'V^, ^ AAG^V^,. (8) 

The free energy penalty AAG™ of a mismatch m at lo- 
cation X, which we will call the focus mismatch (m,x), 
can be estimated either directly using Equation Q or 
via a second mismatch (n, x + Ax) using Equations ([s]) 
and ^ for any choice of n and Ax > 4. Hence, the free 
energy penalty of the focus mismatch can be estimated 
from measurements in many independent ways and they 
should provide the same answer if additivity holds. Note 
that, using Equations ([s]) and ^ IpM drops out in the 
right hand side of Equation ([s]) . 

Figure [s] illustrates how Equation ^ can be used to 
estimate the AAG using different combinations of n and 
Ax. In this specific example we consider AAG^p which 



corresponds both for target ti and t2 to AAG 



GAT 
CGA 



(in the Supplementary Data, we show other examples 
featuring additivity for different focus mismatches). 

In the pane for target t2 , all the estimates of the free 
energy penalty are close the each other, the 48+1 es- 
timates tightly lie around a median value, in this case 
~ 2.1 kcal/mol, indicated by the dotted line. The pic- 
ture in the right pane is a typical one which we observe 
for any focus mismatch (m, x). This confirms that addi- 
tivity holds in the regime Ax > 4, i.e. when mismatches 
are separated by > 4 nt. Moreover, it shows that the mi- 
croarray measurement is internally consistent. Secondly, 
the left pane, i.e. experiment ti, provides the same me- 
dian value for the free energy penalty, showing also the 
robustness of the microarray approach to estimate free 
energies of DNA structures. However, this figure was 
chosen because it is atypical in the sense that one notices 
two pronounced outlying values. They correspond to a 
sequence where both the focus mismatch and the sec- 
ond mismatch are of type AG. Since they clearly deviate 
from an otherwise nicely consistent picture, we believe 
there must a physically underlying reason for it. We will 
come back to this point in the section where we discuss 
thermodynamic outliers. 

Note that with this second method we accessed val- 
ues for the free energy penalties of isolated mismatches 
without using any multiple regression or fitting proce- 
dure, but we simply compared the ratios of intensities. 
Equations (|4])-([6|, to get a consistent set of independent 
estimates. The free energy penalties are then obtained 
from the median over all data points. We compared the 
free energy penalties obtained from this method (median) 
with those obtained from linear regression as discussed 
in the previous section. The two sets of data are well- 
correlated with a Pearson correlation coefficient equal to 
0.966 (see Supplementary Data). This correlation shows 
the equivalence of the two approaches. In this analysis, 
we restricted ourselves to mismatches in the bulk of the 
sequence, i.e. a; is > 5 nt from the border. Closer to the 
border we observe boundary effects, which are covered in 
the next section. 



C. Boundary effects 

The previous section ended by showing the equiva- 
lence of both approaches to access free energy penalties 
of an isolated mismatch, provided the data are restricted 
to bulk mismatches. The direct median method of the 
previous section can also assess penalties of mismatches 
close to the boundary, whereas on the contrary the fitting 
method cannot by construction. The latter, however, has 
the advantage of fitting a full parameter set of the NN 
model and as such can easily provide bulk values for the 
free energy penalty of any isolated mismatch. The com- 
bination of both methods now provides an elegant way 
to assess the effect of boundary proximity on an isolated 
mismatch. Hereto, we introduce the parameter /3 as the 
relative reduction of free energy penalty of a mismatch 
when compared to its bulk value. 
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AAGl 
AAG: 



(9) 



bulk 



In Figure [6] the parameter /3 is shown as a function of 
X after averaging over m. It is clear that, as expected, 
(3 is approximately equal to one in the bulk, whereas 
when approaching the boundary, a reduction of free en- 
ergy penalty occurs which reaches up to 80%. Note that 
for mismatches at the boundary, a: = 1 and x = 25, the 
NN model is not applicable and no data is presented. 
Figure [6] show that the range of the boundary effect is 
4 nt. 



D. Thermodynamic outliers 

As a final result of this article, we come back to the 
two outliers observed in Figure [sj^a); the same devia- 
tions are found in replicated experiments at different con- 
centrations: therefore, they are unlikely due to experi- 



mental errors. For these two cases we find AAGfp'^g 



AAGfg « 1.2 kcal/mol and AAG 



A.G 
10,17 



AAG^-v 



3.1 



kcal/mol, strongly deviating from the median value (w 
2.1 kcal/mol). The common feature of these two se- 
quences is that they involve GA mismatches. The two 
set of mismatches are arranged in an antiparallel way i.e. 
one G and one A are on the same strand. Mismatches of 
GA type in DNA and RNA helices have been the subject 
of several studies in the past [T3H2I]- In the RNA fold- 
ing, it is known that GA pairs contribute substantially 
to the RNA helix stability. Their contribution is compa- 
rable to that of a canonical AT pair. As AT pairs, GA 
form two hydrogen bonds, but can also assume four dif- 
ferent conformations jl4j. The microarray data suggest 
that the antiparallel combination of GA and AG pairs of 
mismatches have a long range interaction effect, which is 
probably a signature of some structural conformational 
change of a double helix containing these pairs. Next- 
nearest neighbor effects extending up to 4 nt distance for 
antiparallel GA mismatches have been reported in the 
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FIG. 5. Free energy penalty AAG^o for focus mismatch (m — A,x — 10) derived from experimental intensities according to 
Equation (|8| as a function of the location x + Ax of the second mismatch {n,x + Ax). For each |Aa; > 4| the three values, 
one per possible mismatch, are indicated by the letter representing the mismatching nucleotide n of the probe. The target 
sequence is written in top of the x-axis in 3'-5' notation, ti in left pane, t2 in right pane. The dotted line corresponds to the 
median value of the 48 estimates. The circled point is the estimate without second mismatch coming from Equation Q. For 

this particular mismatch, the free energy penalty for both ti and t2 is identical and corresponds to AAG ( 
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FIG. 6. Boundary effect: /3, the relative reduction of mis- 
match free energy penalty, as a function of location for ex- 
periment with target t^. Each point is the average of three 
estimates, one per possible mismatch. Data are absent for the 
extremal locations x — 1 and x = 25, since no value can be 
calculated by the NN model. 



case of RNA duplexes in |19j (longer distances were not 
considered that case). We investigated antiparallel GA 
and AG pairs of mismatches also in sequences t2 and t^, 
but found no anomalous behavior in those cases. This 
suggests that the nucleotide sequences between the two 
GA/AG pairs plays an important role in the overall sta- 
bility of the duplex. 

As a further proof of the outlying behavior of antipar- 
allel GA I AG pairs we show in Figure It] a plot of free en- 
ergy penalties for tandem mismatches (neighboring dou- 
ble mismatches). These are again obtained from Equa- 
tion (|6| for different to and n mismatching nucleotides. 
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FIG. 7. The free energy penalty of tandem mismatches, from 
experiment with target ti : AAG ^r^y, ^ , where x and y' are 
complementary to x and y respectively, ah denoted above the 
X-axis are the fixed nucleotides in the target, mn is a tandem 
mismatch in the probe and the vertical position of these letters 
in the plot give the associated free energy penalty. Note the 



low free energy penalty for 



-GA- 

- AG- 



mismatches (encircled) . 



where in the case of tandem mismatches, Ax is equal to 
1. On each location of the sequence our data set con- 
tains nine different types of tandem mismatch. A clear 
boundary effect is noticeable, but when looking at the 
bulk data points tandem mismatch of the type GA/AG 
are again outlying, they appear to be particularly stable 
with a free energy penalty ^ 2 kcal/mol below average. 
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IV. DISCUSSION AND CONCLUSION 

In this article, we have analyzed DNA hybridization re- 
actions in microarrays and quantified free energy penal- 
ties of single and double mismatches. We have shown 
that the experimental data are very precise and repro- 
ducible. The microarray data follow an equilibrium 
isotherm over a range of four orders of magnitude in the 
fluorescence intensities and allow the extraction of accu- 
rate thermodynamic parameters. First, the analysis pro- 
vides a database with a large number of NN parameters 
for isolated mismatches. These parameters correlate well 
with those reported in the literature from hybridization 
experiments in solution. Second, the experiments con- 
tain systematic measurements of hybridization with two 
mismatches, which allowed us to probe the validity limit 
of the NN approximation. We showed that when two 
mismatches are separated by a distance of > 5 nt their 
effect is additive, allowing a standard approach with the 
NN model. However, for shorter distances, the additiv- 
ity is no longer valid and we found that duplexes with 
neighboring mismatches are more stable than expected 
from additivity. This interaction was shown to decay 
exponentially as a function of the distance between mis- 
matches. Further, we investigated the behavior of mis- 
matches close to the helix edges, and showed that their 
free energy penalty is reduced up to 80% when compared 
to the bulk behavior. The boundary effect was observ- 
able up to 4 nt from the helix edge. Finally, we also 
found some thermodynamic outliers, sequences involving 
two antiparallel GA mismatches, in which the mismatch 
interaction appears to persist beyond 5 nt. These out- 
liers were not related to experimental error indicating a 
signature of some structural conformational change of a 
double helix containing these mismatch pairs. 

Overall, the analysis of the microarray data reported 
in this article provides new quantitative insights on the 
DNA hybridization parameters, on the NN model and its 
present limitations. Our study is in line with a number of 



recent articles, which have been dedicated to the inves- 
tigations of fundamental physico-chemical properties of 
DNA arrays |22fl5T| . Due to the relevance of hybridiza- 
tion in many technologies, going from PGR pT to recent 
developments in biosensors, e.g. [32J, a good thermody- 
namic model is also important from the application point 
of view. A precise quantification of interaction free ener- 
gies involved in the hybridization will help to increase the 
accuracy of microarrays and other hybridization-based 
technologies, so that these devices could realize their full 
potential, for instance, for clinical applications |33] . For 
these applications, an increase in specificity and sensitiv- 
ity is very important and can be achieved through better 
understanding of fundamental properties of hybridization 
in these devices. 

There has been considerable attention in recent 
years [3 [12 in understanding the fundamentals 

of hybridization in DNA microarrays and its impact in 
data analysis. Here, we have shown that microarrays are 
a reliable and high-throughput tool to gain insight on 
DNA hybridization thermodynamics. The same method 
could be used to screen other types of defects, as bulges. 
Indeed, it was recently used for understanding loop con- 
formations 1221. 
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Supplementary Data available in Appendix. 
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Appendix A: Nearest neighbor model and linear 
regression 

According to the nearest-neighbor model, the total hy- 
bridization free energy of a target to a probe can be ex- 
pressed as a sum of the dinucleotide parameters AGq ac- 
counting for hydrogen bonding and stacking interactions. 
The index a covers all possible dinucleotide parameters. 
Some examples are: 



AG 



AG 



AG 



5' -AT- 3' 
3' -TA- 5' 
5' -AC- 3' 
3' - TG - 5' 
5' -AA- 3' 
3' - - 5' 



(Al) 



where the underlined nucleotides indicate mismatches. 
In total there are 10 perfect match parameters (taking 
into account symmetries) and 48 parameters in the case 
of a single mismatch. These dinucleotide parameters are 
known not to be unique, see e.g. 

Thermodynamics predicts that the intensity measured 
from a spot li is given by: 



= /n 



(A2) 



where AG^ is the total hybridization free energy between 
a target and a probe, ^ is a parameter which sets the in- 
tensity scale, c the target concentration, R the gas con- 
stant and T the temperature. Iq is the aspecific signal 
that can be considered as background. In this paper the 
stability of duplexes was always compared to that of the 
perfect match, i.e. 



In I, - In I 



PM 



AGi — AGpM 



(A3) 



which defines the free energy penalty of probe i with re- 
spect to the perfectly matching probe. This penalty can 
be expressed as a sum of nearest-neighbor dinucleotide 
parameters: 



58 



AG. 
RT 



(A4) 



where Xia is the frequency matrix, which counts the 
number of times a given dinucleotide term contributes 
to Ui. As an example, for an isolated mismatch of type 



GA we have 
AG, 



GAT . . . 
CGA . . . 

GA 
CG 
GC 
CG 

.AAG'^^^^ 



AG 



PM 



.GCT. 
.CCA. 



AG 



-AG 



AG 



AG 



AG 
TA 
CT\ 
GA 



CGA J 



(A5) 



For notational convenience we used, by symmetry, the 
equality of AG (g^) = (ta) l^^'^^ the mismatch 
on the right hand side of the dinucleotide. For any given 
I, the matrix elements Xia are all zero except for the four 



dinucleotide terms of Equation ( A5 1 which contribute by 
-1-1 for the two dinucleotides with mismatches and —1 for 
the two perfect matching dinucleotides. Equation (A4| 



defines a multiple linear regression, from which the 58 
dinucleotide parameters can be fitted to match all the 
observed free energy penalties of mismatches. Note that 
it defines the dinucleotide parameters not in a unique 
way, e.g. the following transformation 



AG 



AG 



xA 
x'G 
xG 
x'A 



AG 



AG 



xA 
x'G 

xG 
x'A 



(A6) 
(A7) 



in which the same constant e is added and subtracted to 
different dinucleotide parameters, leaves Equation (A5l 
invariant. The triplet parameters, such as defined in 
the last line of Equation ( A5 1 are however unique as ex- 



pected, since they are directly physically accessible. 



Appendix B: Target sequence selection with 
Optimal design 

As discussed above, the dinucleotide parameters can 
be obtained from a linear fit from N independent experi- 
mental measurements. Such an approach always contains 
some uncertainties. These uncertainties can be lowered 
if one takes N large. In our specific case N equals the 
number of spots on the microarrays, and can be increased 
by combining data from more arrays (see main paper for 
experimental setup). Further, for a given fixed value of 
N one can use some optimization criterion to select the 
best N measurements which minimize the uncertainties 
on fitted parameters. In our case this comes down to the 
selection of a target sequence with good statistical prop- 
erties. The theory of Optimal Design establishes some 
criteria for this purpose and we briefiy discuss this the- 
ory here. 

Before entering into the details of the optimization 
followed in the microarray experiment we discuss a one 
dimensional example, which illustrates the optimization 
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method. Let us take the example of a simple linear re- 
gression with an intersect set to zero (corresponding to a 
one-dimensional system) : 



(Bl) 



where (3 is the unknown of the problem, Xi and yi are 
respectively the input and output of the experiment i 
and can take any real value. The parameter /3 can be 
obtained by the least square method : 



- x){y^ - y) 



(B2) 



where the symbol . . . means the average over the N ele- 
ments. The error on j3 is given by : 



A/3 



S 



1 



A'»i NV Y.{xi~xf 



(B3) 



where S is the cost function of the system. Equation (B3 ) 
implies that the error can be decreased by enlarging the 
sampled points {N) or, for N fixed, by increasing the vari- 
ance of the variable Xi. The latter criterion can be used 
in the design of the experiment by performing measure- 
ments yi for a well spread set of points Xi. Indeed, it is 
intuitively clear that when Xi are very close to each other 
(small variance) one has a large uncertainty on the esti- 
mate of the slope j3. In what follows we discuss about op- 
timal design criteria in higher dimensions, which roughly 
correspond to the idea of the maximization of the vari- 
ance in the previous one-dimensional example. 

We define first the so-called information matrix M = 
X^ X, wh ere X is the frequency matrix defined in Equa- 
tion (A4| and where X^ denotes its transpose. In terms 



of matrix elements: 



JV 

Ma0 = ^ XiaXii3 
i=l 



(B4) 



which is thus in our case a square symmetric matrix of 
dimension 58 x 58. 

The information about the quality of the experimental 
design is encoded in M and in our case is defined by the 
sequence of the target oligo in the experiment (see main 
paper for experimental setup). The three most used crite- 
ria in optimal design are the A-, D- and E-optimality. A- 
optimality corresponds to minimizing the trace of M^^, 
D-optimality corresponds to minimizing the determinant 
of M^^ and E-optiniality corresponds to maximizing the 
lowest eigenvalue of M. Roughly speaking, these strate- 
gies amounts to maximize the information encoded in AI 
[5]. We note that in the linear problem of Equation ( A4) 
the information matrix has a minimum of 7 null eigen- 
values (see the supplementary material of Ref. [B,, for a 
detailed explanation). These come from unavoidable de- 
generacies of the problem, or equivalently from the fact 
that the dinucleotide parameters are not unique (see e.g. 





^ high concentration 




saturation 


low concentration j/^ 




detecdon limit 





theoretical Intensity 

FIG. 8. Sketch to show non-linear behaviour due to detection 
Hmit on low end and saturation on high end. 



Equations (A6l and (A7l). Having some zero eigenval- 
ues, the information matrix M is not invertible, therefore 
we are working with pseudo-inverse which is obtained 
from the singular value decomposition of M. 

The three target sequences, ti,t2 and ^3, which were 
used for the experiments and which are mentioned in ta- 
ble 1 of the main article were selected as follows. We 
collected a set of candidate targets by scanning over a 
piece of the human genome and taking subsequences of 
length 25. The first criterion was to choose sequences 
with minimum, unavoidable, number of 7 zero eigenval- 
ues in order to get the minimum number of degeneracies 
when solving the linear system to estimate the nearest- 
neighbor parameters, as discussed above. For ti, we con- 
sidered a subset of sequences with a minimum distance of 
3 nucleotides from the border and a minimum distance of 
3 nucleotides between 2 mismatches. For t2 and ij,. the 
minimal distance from the border is 4 nucleotides and the 
distance between 2 mismatches is at least 5 nucleotides. 
Therefore, the constraint on the subset to select t\ was 
weaker than and ts. Since the constraint for ^2 and ^3 
is stronger than ti, in this case, the number of equations 
in the linear system is lower and it is more difficult to 
find subsequences of length 25 which display the mini- 
mum number of 7 zero eigenvalues. For the same order 
of calculation, we managed to find 130 sequences for t^ 
and only a few sequences for t2 and t-^. For t\, this set 
of candidates was subsequently ranked according to the 
three optimal design criteria A, D and E. Finally, the 
candidate targets which ended up as top-ranked on all 
three criteria were retained. Moreover, we checked the 
energy for the target to fold on itself. For the 3 targets, 
it takes a reasonable value. 



Appendix C: The linear regime 

As a measurement device the microarray technology 
is faced with a detection limit in the low measurement 
regime and a saturation in the high end: see sketch in 
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FIG. 9. Plot of 7?i(c„, c„+i), Cn+i > c„, as a function of 
the Intensity Ii{c„). The used target concentrations are ci = 
20pM, C2 = lOOpM, C3 = 500pA'/ of target fa. 



for the combinations (ci = 20pM, C2 = lOOpM) and 
(c2 = 100pAf,C3 = 500pM) of target <2- From this 
picture it is clear that for a large part of the intensity 
range R equals one and supports the linear regime. For 
the green dots, there is a deviation in the high intensity 
range due to the proximity of saturation of these spots in 
the 500pM experiment. For the red dots a deviation is 
present due to proximity of the detection limit for these 
spots in the 20pM data. This approach gives a criterion 
to assess the validity of the linear regime per spot and 
the possibility to make a correction for the non-linear 
behaviour close to saturation or detection limit. 



Appendix D: Free energy additivity of mismatches 

In the main article the additivity of free energy penal- 
ties of mismatches was shown when mismatches were sep- 
arated by more than four nucleotides. For two examples, 
this was explicitly shown in Figure 5 of the main article. 
In this section we add some further examples of the addi- 
tivity with similar plots. These are shown in Figure [lO] 



Figure [8j In our research we want to limit ourselves to 
measurements in the linear regime. To assess which data 
meet this requirement we combine experiments which are 
identical (identical target sequence, identical probe sets, 
identical hybridisation conditions) except for the concen- 
tration c„ of the target. If the data is in the linear regime 
we expect the intensity of a spot i of the experiment with 
target concentration c„ to be 



OC C; 



,exp{-AGjRT). 



(CI) 



If we now combine two experiments, one with target con- 



centration Cn and one with Cn+i > c„ 
spot i the quantity R as 



(Cn ; Cj2-f 1 ) 



and define for each 



(C2) 



than we expect R to be equal to 1 when both intensi- 
ties are in the linear regime. However for low c„ the 
spot intensity /i(c„) can be close to detection limit and 
consequen tly b e higher than predicted by the theory of 
Equation (Cll, or for high c„+i the intensity Ji(c„+i) 



can be close to saturation and consequently lower than 
theoretically expected. In both cases R will be above 
one. The result of this analysis is shown in Figure [9] 



Appendix E: Self-consistency in free energy 
penalties estimation of triplet nucleotides 

In the main article we present two different approaches 
that can be used to estimate free energy penalties of sin- 
gle mismatches in a triplet of nucleotides such as in Equa- 
tion (3) of the main article. The first method, i.e. by lin- 
ear fitting, produces a robust estimation provided that 
each of the 58 NN dinucleotide parameters are equally 
well-represented. This was achieved by the use of Opti- 
mal Design principle in designing the experiments. An- 
other method is by taking the median of data points 
from ratios of intensities following Equations (4)-(6) of 
the main article. Figure [TO] of this document shows six of 
these unique triplets in which the free energy penalties 
are indicated by the horizontal line from taking the me- 
dian of each independent estimates. It is then imperative 
to see if these two methods are equivalent in providing the 
estimates. Figure [TT] shows that the free energy penal- 
ties calculated from the two methods are well-correlated 
with Pearson correlation 0.966 (such as mentioned in the 
main article). This is indicating the equivalence of the 
two methods. This is also a proof that our experiments 
are self-consistent from the different perspective of these 
two approaches. 
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FIG. 10. A few examples of different focus mismatches showing additivity as |Aa; > 4|. Similar to Figure 5 in the main paper, 
the target shown in top of the x-axis is in 3' to 5' orientation, ti are on the left side, t2 are on the right side. 
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FIG. 11. Comparison of estimates of free energy penalties 
for isolated mismatches, obtained in two different ways: from 
the hnear model fit and from the median of independent esti- 
mates. The two sets of data are strongly correlated (Pearson's 
correlation 0.966). 



